HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MASTER THESIS
Improving the quality of Inverse text normalization
based on neural network and numerical entities
recognition
PHAN TUAN ANH
Anh.PT211263M@sis.hust.edu.vn
School of Information and Communication Technology
Supervisor: Associate Professor Le Thanh Huong
Supervisor’s signature
School: Information and Communication Technology
May 15, 2023
SĐH.QT9.BM11 Ban hành lần 1 ngày 11/11/2014
CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập – Tự do – Hạnh phúc
BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC
Họ và tên tác giả luận văn: Phan Tuấn Anh
Đề tài luận văn: Cải thiện chất lượng cho bài toán chuẩn hóa ngược văn bản
dựa trên mạng nơ ron và nhận diện thực thể s.
Chuyên ngành: Khoa học dliệu và trí tuệ nhân tạo (Elitech)
Mã số SV: 20211263M
Tác giả, Người hướng dẫn khoa học Hội đồng chấm luận văn xác nhận tác giả đã
sửa chữa, bổ sung luận văn theo biên bản họp Hội đồng ngày 22/4/2023 với các nội
dung sau:
1. Sửa lỗi chính tả, soát loại câu ch, bố cục luận văn.
2. Lược bỏ phần miêu ttrong lời chú thích của các hình ảnh, các bng.
3. Thay thế các hình 3.1, 3.2 cũ bằng các hình 3.1, 3.2 mới.
4. Thay thế hình 1.1 (cũ) bằng hình 1.1 (mới) nhằm miêu tvai trò của
ITN module trong hệ thống xlý tiếng nói.
5. Bsung thêm phần đánh giá chi tiết độ chính các từng module (trong
phần 4.4.1.1).
6. Bsung phần phân tích lỗi gặp phải trên bdữ liệu tiếng Việt (trong
phần 4.4.1.2).
7. Thêm dvmột số dạng lỗi sai mô hình gặp phải trên bdliệu
tiếng Việt (trong phần 4.5).
8. Bsung vviệc xây dựng bdliệu cho tiếng Việt, làm rõ tập nhãn
(trong phần 3.3).
9. B2 bảng 4.2 4.3, thay bằng các đồ thị 4.1, 4.2, 4.3, 4.4, 4.5, 4.6
nhằm trực quan hóa hơn kết qu.
10. Đưa ra nhận xét giải thích tập dữ liệu tiếng Việt cho kết quả kém
hơn tập dữ liệu tiếng Anh (trong phần 4.4.1.2).
11. Sửa lại tài liệu tham khảo (không dùng trích dẫn các bài báo arxiv).
Ngày tháng năm
Giáo viên hướng dẫn Tác giả luận văn
CHỦ TỊCH HỘI ĐỒNG
Graduation Thesis Assignment
Name: Phan Tuan Anh
Phone: +84355538467
Email: Anh.PT211263M@sis.hust.edu.vn; phantuananhkt2204k60@gmail.com
Class: 21A-IT-KHDL-E
Affiliation: Hanoi University of Science and Technology
Phan Tuan Anh - hereby declare that this thesis on the topic ”Improving the qual-
ity of Inverse text normalization based on neural network and numerical entities
recognition” is my personal work, which is performed under the supervision of
Associate Professor Le Thanh Huong. All data used for analysis in the thesis are
my own research, analysis objectively and honestly, with a clear origin, and have
not been published in any form. I take full responsibility for any dishonesty in
the information used in this study.
Student
Signature and Name
Acknowledgement
I wish that a few lines of short text could help me convey my most sincere grati-
tude to my supervisor: Associate Professor Le Thanh Huong, who has driven and
encouraged me throughout the 2 years of my master’s course. She has listened
to my idea and given me numerous valuable advice for my proposal. Besides,
she also indicated the downsides of my thesis, which are very helpful for me to
perfect my thesis.
I would like to thank Ph.D. Bui Khac Hoai Nam and other members of the NLP
team in Viettel Cyberspace Center, always support and provide me with the foun-
dation knowledge. Especially, my leader Mr. Nguyen Ngoc Dung always creates
favorable conditions for me to conduct extensive experiments in this study.
Last but not least, I would like to thank my family, who play the most impor-
tant role in my life. They are constantly my motivation to accept and pass the
challenge I have at this time.
Abstract
Neural inverse text normalization (ITN) has recently become an emerging ap-
proach for automatic speech recognition in terms of post-processing for readabil-
ity. In particular, leveraging ITN by using neural network models has achieved
remarkable results instead of relying on the accuracy of manual rules. How-
ever, ITN is a highly language-dependent task especially tricky in ambiguous
languages. In this study, we focus on improving the performance of ITN tasks
by adopting the combination of neural network models and rule-based systems.
Specifically, we first use a seq2seq model to detect numerical segments (e.g.,
cardinals, ordinals, and date) of input sentences. Then, detected segments are
converted into the written form using rule-based systems. Technically, a major
difference in our method is that we only use neural network models to detect
numerical segments, which is able to deal with the low resource and ambiguous
scenarios of target languages. In addition, to further improve the quality of the
proposed model, we also integrate a pre-trained language model: BERT and one
variant of BERT ( RecogNum-BERT) as initialize points for the parameters of
the encoder.
Regarding the experiment, we evaluate different languages: English and Viet-
namese to indicate the advantages of the proposed method. Accordingly, empir-
ical evaluations provide promising results for our method compared with state-
of-the-art models in this research field, especially in the case of low-resource and
complex data scenarios.
Student
Signature and Name
TABLE OF CONTENTS
CHAPTER 1. Introduction.................................................................... 1
1.1 Research background .......................................................................... 1
1.2 Research motivation............................................................................ 4
1.3 Research objective.............................................................................. 5
1.4 Related publication ............................................................................. 5
1.5 Thesis organization............................................................................. 6
CHAPTER 2. Literature Review ........................................................... 7
2.1 Related works .................................................................................... 7
2.1.1 Rule-based methods ................................................................. 7
2.1.2 Neural network model .............................................................. 8
2.1.3 Hybrid model .......................................................................... 11
2.2 Background........................................................................................ 11
2.2.1 Encoder-decoder model ............................................................ 11
2.2.2 Transformer............................................................................. 15
2.2.3 BERT...................................................................................... 18
CHAPTER 3. Methodology ................................................................... 20
3.1 Baseline model................................................................................... 20
3.2 Proposed framework ........................................................................... 20
3.3 Data creation process .......................................................................... 23
3.4 Number recognizer ............................................................................. 24
3.4.1 RNN-based and vanilla transformer-based.................................. 24
3.4.2 BERT-based ............................................................................ 25
3.4.3 RecogNum-BERT-based........................................................... 25
3.5 Number converter ............................................................................... 28
CHAPTER 4. Experiment ..................................................................... 30
4.1 Datasets ............................................................................................. 30
4.2 Hyper-parameter configurations ........................................................... 31
4.2.1 RNN-based and vanilla transformers-based configurations........... 31
4.2.2 BERT-based and RecogNum-BERT-based configurations ............ 32
4.3 Evaluation metrics .............................................................................. 33
4.3.1 Bi lingual evaluation understudy (BLEU)................................... 33
4.3.2 Word error rate (WER) ............................................................. 33
4.3.3 Number precision (NP)............................................................. 34
4.4 Result and Analysis ............................................................................ 34
4.4.1 Experiments without pre-trained LM ......................................... 35
4.4.2 Experiments with pre-trained LM.............................................. 41
4.5 Visualization ...................................................................................... 43
CHAPTER 5. Conclusion ...................................................................... 44
5.1 Summary ........................................................................................... 44
5.2 Future work ....................................................................................... 45
LIST OF FIGURES
1.1 The role of Inverse text normalization module in spoken dialogue
systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.1 The pipeline of NeMo toolkit for inverse text normalization. . . . 9
2.2 The overview of the encoder-decoder architecture for the example
is a machine translation (English Vietnamese). . . . . . . . . . 12
2.3 The overview of the using LSTM-based as encoder block (left)
and the architecture of LSTM (right). . . . . . . . . . . . . . . . . 14
2.4 The illustration of the decoding process. . . . . . . . . . . . . . . . 15
2.5 The general architecture of vanilla Transformer, which is intro-
duced in [20]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 The description of scale dot product attention (left) and the multi-
head attention (right). . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 The overview of pertaining produces of BERT that is trained in
large corpus with the next sentence prediction and mask token
prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 The overview of my baseline model (the seq2seq model for ITN
problem) [7]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 The general framework of the proposed method (hybrid model)
for the Neural ITN approach. . . . . . . . . . . . . . . . . . . . . . 21
3.3 The overview of data creation pipeline for training Number rec-
ognizer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 The training process and inference process of applying Bert as
initializing encoder for the proposed model. . . . . . . . . . . . . 26
3.5 The overview of the training and inference process of my pro-
posed model when integrating the RecogNum-BERT. . . . . . . . 27
3.6 Our architecture for creating the RecogNum-Bert. . . . . . . . . . 28
3.7 The pipeline for data preparation for fine-tuning RecogNum-Bert. 29
4.1 The comparison of models on English test set with BLEU score
(higher is better). . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 The comparison of models on English test set with WER score
(lower is better). . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 The comparison of models on English test set with NP score
(higher is better). . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 The comparison of models on the Vietnamese test set with BLEU
score (higher is better). . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 The comparison of models on the Vietnamese test set with WER
score (lower is better). . . . . . . . . . . . . . . . . . . . . . . . . 40
4.6 The comparison of models on the Vietnamese test set with NP
score (higher is better). . . . . . . . . . . . . . . . . . . . . . . . . 41
LIST OF TABLES
1.1 Examples of the ambiguous semantic problem in the Vietnamese
language. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 An example output in Number converter. . . . . . . . . . . . . . 29
4.1 The training size, validation size, vocabulary size, and average
sequence length of the input of my datasets. . . . . . . . . . . . . 31
4.2 The results of Number Recognizer in the validation set with BLEU
score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Comparison between variants of the proposed method with dif-
ferent encoders for the module number recognizer: Transformers
Base, BERT, RecogNum-BERT in Bleu score. . . . . . . . . . . . 42
4.4 Comparison between variants of the proposed method with dif-
ferent encoders for the number recognizer: vanilla transformer,
BERT, RecogNum-BERT in WER score. . . . . . . . . . . . . . . 42
4.5 Examples for prediction error of the baseline model in English. . . 43
4.6 Examples of error prediction of the proposed model in Vietnamese. 43
ACRONYMS
Notation Description
ASR Automatic Speech Recognition
BERT Bidirectional Encoder Representations from Transformers
Bi-LSTM Bidirectional Long short term memory
BLEU Bi Lingual Evaluation Understudy
CNN Convolutional neural network
end2end End to End
FST Finite state transducer
ITN Inverse text normalization
LM Language model
LSTM Long short-term memory
MT Machine translation
NLP Natural language processing
NN Neural network
NP Number precision
OOV out of vocabulary
POS Part of speech
seq2seq Sequence to Sequence
TN Text normalization
TTS Text To Speech
WER Word error rate
WFST Weight finite state transducer
Chapter 1
Introduction
1.1 Research background
Inverse text normalization is a natural language processing (NLP) task of con-
verting a text in spoken form (source sentence) to the corresponding written form
(target sentence), which is applied to most speech recognition systems. Fig. 1.1
depicts the crucial role that the ITN module plays in one dialogue system. Specif-
ically, in this pipeline, the audio signal is processed by Automatic Speech Recog-
nition (ASR) module to create the text in spoken form. The text under this form
is lower and does not contain any punctuation or numerical tokens. Consequen-
tially, ITN processes them to yield the text in written form. In the written form,
the numerical tokens are converted into a natural formation.
Figure 1.1: The role of Inverse text normalization module in spoken dialogue
systems.
Additionally, Text normalization (TN) is the inverse problem of ITN, which trans-
forms the text in written form into spoken form. Despite being two opposite pro-
cesses, ITN and TN have a close relationship, and many researchers use similar
1
approaches and techniques for dealing with them. Nevertheless, different from
the exploitation of promising methods for the TN problem in recent years [1],
there are not many remarkable achievements for the ITN problem, which is re-
garded as one of the most challenging NLP tasks.
The conventional approach for addressing ITN is rule-based systems. For in-
stance, Finite State Transducer (FST)-based models [2] have proved the compet-
itive results [3]. However, the major problem with this approach is the scalability
problem, which requires complex accurate transformation rules [4]. Recently,
Neural ITN has become an emerging issue in this research field, by exploiting the
power of neural networks (NN) for ITN tasks.
Furthermore, due to the significant difference between written and spoken forms,
handling numbers with minimal error is a central problem in this research field.
In particular, as the way that humans handle numeric values, the models should
work well on both consecutive tasks such as recognizing the parts that belong to
numeric values and combining those parts to precise numbers.
Specifically, NN-based models, typically seq2seq, have achieved high perfor-
mances and become state-of-the-art models for the ITN problem [5][6] [7]. Nev-
ertheless, as I mention above, ITN is a highly language-dependent task and re-
quires linguistic knowledge. In this regard, the data-hungry problem (i.e., low
resource scenarios) is an open issue that needs to take into account for improv-
ing performance. For instance, in the shortage-data situation, models might lack
information for the training stage in order to recognize and transform numerical
segments, which is the derivation of the bad performance of ITN.
In this study, I take an investigation to improve the performance of Neural ITN
in terms of low resources and ambiguous scenarios. Particularly, for formatting
number problems, conventional seq2seq models might fail to generate sequen-
tially character by character digit, which often appears in long numbers (e.g.,
phone numbers, big cardinal). For example, the number ’one billion and eight’
must be converted to 10 sequential characters: ’1 0 0 0 0 0 0 0 0 8’. Moreover,
the poverty of data in the training process can cause this issue more worse in case
the considered languages have lots of ambiguous semantics between numbers and
words. For instance, in Vietnamese, the word ’kh
ˆ
ong’ (English translate: no) can
be a digit ’0’, but also used to indicate a negative opinion. Tab. 1.1 illustrates sev-
eral examples of the ambiguous semantic problem in the Vietnamese language.
In this paper, the proposed framework includes two stages: i) In the first stage,
I use a neural network model to detect numerical segments in each sentence ii)
2
Spoken form (English translation) Number Word
tôi không thích cái bánh y (I do not like this cake)
không số tự nhiên nhỏ nhất -(zero is the smallest natural number)
năm một nghìn chín trăm chín bảy (nineteen ninety-seven)
năm mươi nghìn (fifty thousand)
chín qủa táo (nine apples)
quả táo chín (a ripe apple)
số ba ba (thirty three)
con ba ba ( turtle)
chú ba (person name)
Table 1.1: Examples of the ambiguous semantic problem in the Vietnamese lan-
guage.
Then, the output of the first phase is converted into the written form by using a
set of rules. Accordingly, the main difference between my method compared to
previous works is that I only use the neural network to detect numerical segments
in each sentence as the first phase. The reading number is processed in the sec-
ond stage by a set of rules, which is able to supply substantial information for
the system without requiring much data to learn as end-to-end models. Addition-
ally, in the first stage of my pipeline, I implement the numerical detector as a
seq2seq model in which I also investigate the efficiency of several conventional
approaches: RNN-based, and transformer-based. Besides, I also take advantage of
using a pre-trained language model (BERT and a variant of BERT: RecogNum-
BERT) for boosting the performance.
Generally, the main contributions of my method are as follows:
I propose a novel hybrid approach by combining a neural network with a
rule-based system, which is able to deal with ITN problems in terms of low
resources and ambiguous scenar ios. This is the first research to conduct ex-
per iments and analyze this scenario of data.
I evaluate the proposed methods in two different languages such as English
and Vietnamese with promising results. Specifically, my method can be ex-
tended easily to other languages without requiring any linguistics grammat-
ical knowledge.
I propose a novel approach for integrating the knowledge of the pre-trained
language model (BERT) to enhance the quality of my method.
I present a novel pipeline to build an ITN model for Vietnamese that is based
on a neural network.
3
1.2 Research motivation
Regarding the research motivation, I have two remarkable points as follows:
The scenario of data I would like to consider throughout this thesis is: low
resource and ambiguous data. As I mention in detail later in chapter Ex-
periment, for research purposes, almost previous researchers reuse the data
for TN problem [2]. Essentially, because ITN and TN are opposite prob-
lems, the authors reverse the order of each sample in the TN data set to turn
it into the sample for ITN. Intuitively, I can suppose that there is no stan-
dard data set for ITN tasks. Vietnamese also witness the same issue when
there is any annotated data set for ITN. Besides, building the data set for
ITN problem in any language cost considerably in workforce and time. This
phenomenon may lead to the limitation of the data set for both industry and
academic environments. It is the biggest motivation for me to consider the
scenario low resource data. Additionally, the complexity of data also needs
to be considered. For several languages, including Vietnamese, ambiguity
become a serious problem (as some example in the 1.1). This increases the
difficulty level of data that one model has to deal with. More specially, both
conventional methods: using rule-based or using neural networks might have
poor performance when solving one complex language. As a consequence,
besides the poor resource scenario, I also have the motivation to consider the
second scenario: ambiguous data.
As I mentioned above, both approaches: using a rule-based system and using
a neural network system might have obstacles when facing one low resource
and complex data. Two aforementioned methods own particular downsides.
Building a rule-based system is extremely complicated and require a great
deal of knowledge about language expert. In addition, the rule-based system
is limited in the ability to generalize, upgrade and extend. For the ambiguous
language as Vietnamese, this problem is more serious and in some cases, the
rule-based system is fail. Regarding the neural network, despite having the
capacity to deal with the generalization and ambiguity problem by learning
the contextual embedding, this approach requires a wealthy resource. More-
over, because there is no mechanism for controlling effectively the output,
using one end2end model with a neural network also causes the issue: un-
recoverable error. The above reasons encourage us to invent a new method
to combine two approaches to harmonize both strengths of these methods as
4
well as eliminate their drawbacks.
1.3 Research objective
As almost the researcher mentioned in existing works about both TN and ITN
problems, the vital difference between text in written form and in spoken form
is numerical related entities. These object, so-called: semiotic class exists in
sentence under diverse form such as ORDINAL, CARDINAL, TELEPHONE,
NUMBER SERIES, DATE,.... Handling these object require high accuracy be-
cause if one system handles wrongly one number, it can change entirely the mean-
ing of this sentence. Therefore, in this thesis, the central object that I want to focus
on is the numerical entities. Besides, as I mentioned in the research motivation
section, I put my method under the scenario: of low resources and ambiguous data
and my method is the combination of two conventional approaches: rule-based
and neural network systems.
Using the hybrid model, I want to prove the efficiency of my method in limited
and complex data. Not only that, I also compare the performance of my model
to the baseline model in the wealthy resource to see the downside of my method
when the data is rich.
Due to the effectiveness of transferring the knowledge from the pre-trained lan-
guage model to the downstream NLP tasks, I also would like to test whether this
statement is suitable for the ITN problem. Additionally, by inventing the novel
variant of BERT, I would like to test whether supplementing the additional in-
formation about the appearance of numerical entities in sentences can boost the
performance or not.
Finally, via this thesis, I desire to boost the awareness of researchers about this
problem as well as my scenarios. I hope the presented method in my thesis is
helpful to apply for real production.
1.4 Related publication
Phan, T. A., Nguyen, N. D., Thanh, H. L., Bui, K. H. N. (2022, December). Neu-
ral Inverse Text Normalization with Numerical Recognition for Low Resource
Scenarios. In Intelligent Information and Database Systems: 14th Asian Con-
ference, ACIIDS 2022, Ho Chi Minh City, Vietnam, November 28–30, 2022,
5
Proceedings, Part I (pp. 582-594). Cham: Springer International Publishing.
(Accepted)
1.5 Thesis organization
My thesis is organized into ve main chapters. In the first chapter, I introduce
the fundamental definition of the Inverse text normalization (ITN) problem. In
my research background, I supply the basic knowledge as well as the important
role that ITN plays in spoken dialogue systems. The following parts are litera-
ture reviews, methodology, experiments and results, conclusion and future work,
respectively. Particularly, the remaining chapter is concluded as follows:
Chapter 2 describes the literature review. This chapter is further divided into 2
subparts: related work and background. In related work, I summarize the existing
work to deal with ITN and TN problems that are categorized into three main
approaches. The background sector provides the fundamental knowledge about
the backbone of my methodologies such as encoder-decoder architecture, LSTM-
based, Transformer based seq2seq model, and overview of BERT.
Chapter 3 reveals the detail of my methodologies. In this chapter, I introduce
the overview of the baseline model, the proposed model, the pipeline for creating
data for seq2seq, the LSTM-based, transformer-based, BERT, and one variant
of BERT to apply for the encoder-decoder model. In the end of this chapter, I
describe the way I build the rule-base system for both English and Vietnamese
that plays a crucial role in the whole system.
Chapter 4 firstly shows the important trait of my data, the detail of configurations
that I conduct in my experiments in this thesis. Besides, this chapter also provides
a thorough analysis and comparison between my method and the baseline.
Chapter 5 summarizes my contributions to this thesis. In this chapter, I also give
some interesting ideas that I am going to investigate in the future.
6
Chapter 2
Literature Review
2.1 Related works
The research on ITN is closely related to the TN problem in which recent works
can be classified into three main approaches as follow:
Rule-based method that mainly leverages grammar rules of a particular lan-
guage
Deep learning method that considers ITN task as a machine translation task.
This architecture taken into account is the seq2seq model.
Hybrid model, which can deal with the disadvantages of both aforemen-
tioned methods.
2.1.1 Rule-based methods
The conventional ASR and Text-to-Speech (TTS) systems are built upon Weighted
FST grammar for the TN and ITN problem.
Kestrel, a component of the Google TTS synthesis system [2] that concentrates
on solving the TN problem. Kestrel first categorizes the numerical entities in
text into multi semiotic classes: measure, percentages, currency amount, dates,
cardinal, ordinal, decimal, fraction, times, telephone numbers, and electronic ad-
dresses. Sequentially, they use a protocol buffer with the basic unit: message.
One message is essentially a dictionary, which consists of named keys and cer-
tain values. The values might include integers, strings, booleans, or even other
nested messages. The whole process can be divided into 2 stages: classification
7
and verbalization. While the classifier is responsible for recognizing what semi-
otic class the corresponding token belongs to via WFST grammars, the verbalize
will receive the message and convert it into the right form. In terms of evaluation
results, Kestrel achieves promising accuracy in both English and Russian. Espe-
cially, for both languages, their proposal reaches virtually absolute accuracy for
several semiotic classes: cardinal, date, decimal, and electronic, ... and 99% on
the Google TN test data set [4].
For ITN, the set of rules in [5] achieves 99% on internal data from a virtual
assistant application. Recently, in order to invent a new path for developing pro-
duction seamlessly, Zhang et. al introduce an open-source python WFST-based
library [3]. The illustration of the system can be described in figure 2.1. Com-
pared it’s pipeline to Kestrel, this flow has somewhat similar to the way the author
defines the semiotic class, except that the process will be reversed. Nemo ITN
also includes a two-stage normalization pipeline that first detects semiotic to-
kens (classification) and then converts these to written form (verbalization). Both
stages consume a single WFST grammar. The major problem is that this approach
requires significant effort and time to scale the system across languages. Never-
theless, by using the Python library Pynini to formulate and compile grammar,
Nemo ITN can easily to add new rules, modify an existing class as well as add an
entirely new semiotic class. This is a huge advantage for deploying them into the
production environment. In terms of results, the Nemo toolkit obtains the exact
match of 98.5% for CARDINAL and 78.65% for DECIMAL on the cleaned data
set.
Overall, for using grammar rules such as WFST, we do not need an annotated
data set. But this method cost significantly the language experts and leads to the
challenge to scale the model.
2.1.2 Neural network model
Recurrent Neural Network (RNN)-based seq2seq models [8], have been adopted
for reducing manual processes.
For the TN problem, Sproat et. al. [9] consider this problem as a machine trans-
lation task and develop an RNN-based seq2seq model trained on window-based
data. Specifically, an input sentence is regarded as a sequence of characters, and
the output sentence is a sequence of words. Furthermore, since the length of
the sequence input problem, they split a sentence into chunks with a window
size equal to three for creating sample training in which normalized tokens are
8
Figure 2.1: The pipeline of NeMo toolkit for inverse text normalization.
marked by distinctive begin tag < norm > and end tag < /norm >. In this regard,
this approach is able to limit the number of input and output nodes to something
reasonable. Their architecture neural network follows closely that of [10].
Sequentially, Sevinj et. al [11] proposed a novel end-to-end Convolutional Neural
Network (CNN) architecture with residual connections for the TN task. Particu-
larly, they consider the TN problem as the classification problem which includes
two stages: i) First, the input sentence is segmented into chunks, which is similar
to [9] and use a CNN-based model to label each chunk into corresponding class
based on scores of soft-max function; ii) After the classification stage, they ap-
ply rule-based methods depending on each class. Rather than using the grammar
rules, leveraging the CNN model with prove the efficiency of this method with
approximately 99.44% for accuracy over total semiotic classes.
Mansfield et. al in [12] firstly take advantage RNN-based seq2seq model for
dealing with the TN problem. Especially, both input and output are processed
using subword units to overcome the OOV issue. The numerical tokens are even
tokenized into the character levels. In addition, the linguistic features such as:
1) capitalization, upper, lower, mixed, non-alphanumerical, foreign characters.
2) position: beginning, middle, end, singleton. 3) POS tags 4) labels. also are
taken into account for enhancing the quality of neural machine translation mod-
els. To integrate linguistic features with subword units, the concatenate or add
operator is utilized and the combined embedding is fed into the Bi-LSTM en-
coder. When compared with the window-size-based method as a baseline, the
9
full model achieves the SER (sentence error rate) is only 0.78%, the WER (word
error rate) is 0.17% and the BLEU score is approximately 99.73%.
For the ITN problem, Sunkura et. al in [7] also consider ITN under the form of
a machine translation task. Inspire of subword tokenizer methods in [12], they
first tokenizer sentence using the Sentence piece toolkit [13], then feed the em-
bedding feature to the encoder. The output of the decoder is recovered through
several post-processing steps. Their proposed architecture uses both RNN-based
and Transformer-based models for encoder-decoder with copy attention mech-
anism in decoding. The result comparisons show the best performance of the
Transformer-based method in various domains of the test set by only 1.1% WER
for Wikipedia and 2.0%, 2.4%, 1.5% for CNN, Daily Mail, and News-C, re-
spectively. Additionally, by the impressive outcome of using neural networks for
ITN, the authors also consider take the pre-trained language model information to
boost the performance. They find out that using pre-trained models such as BART
[14] as initialize for encoder and decoder does not create good performance while
using BERT [15] to extract the contextual embedding and fuse it into each layer
of transformers as [16] can benefit minimally. Beyond English, their model also
grabs the good WER score in German, Spanish, and Italian with 2.1%, 5%, and
1.6%, respectively.
In conclusion, the method that use a neural network model for dealing with both
TN and ITN tasks can help the model get rid of serious costs for language ex-
pert information as well as complex structures of the model. These methods can
easily to expand for multiple languages and scale to huge systems. Nevertheless,
implementing these methods also face several challenges:
Training neural networks require numerous labeled data. Practically, in most
of the research regarding ITN problem, to yield the pair-sample for training,
the author uses the Google TN data set, and swaps input and output to create
data for ITN. This phenomenon leads to concern about the quality of the
model under the poor resource scenario.
Because of using seq2seq as an end-to-end model, the neural networks can
generate unrecoverable errors. When handling the semiotic class as a num-
ber of entities, the error can cause the severe problems in the empirical ap-
plications.
10
2.1.3 Hybrid model
Both using WFST and neural networks for ITN have downsides of their own.
WFST-based methods strongly depend on the volume and accuracy of a set of
grammar rule that language experts can provide, and in some cases, they are
not able to cover all situations. Meanwhile, using neural network consume a
great of annotated data and suffer unrecoverable errors. Therefore, many existing
approaches combine two aforementioned methods to overcome their weakness of
them.
Pusateri et. al. [5] presents a data-driven approach for ITN problems by a set
of simple rules and a few hand-crafted grammars to cast ITN as a labeling prob-
lem. Then, a bi-directional LSTM model is adopted for solving the classification
problem.
Sunkura et. al [7] propose a hybrid approach combining transformer-based seq2seq
models and FST-based text normalization techniques, where the output of the
neural ITN model is passed through FST. They use a confidence score that is
emitted by the neural model to decide whether the system should use neural ITN
output. Intuitively, the confidence score can be considered as a filter to switch the
overall model to choose the output of FST rather than the neural model in case the
model encounters an unrecoverable error. Basically, this approach is not indeed
the combination of two approaches because the final output is produced by one
of two models.
2.2 Background
In this section, I introduce briefly several crucial pieces of knowledge related to
this work. They include the encoder-decoder model, transformer architecture,
and pre-trained language model as BERT. All of them are the significant unit that
constitutes my proposed model.
2.2.1 Encoder-decoder model
The encoder-decoder model was first presented in [8] by Sutsekeve et. al, aim-
ing to solve the seq2seq problem. The seq2seq problem generally consists of an
NLP task in which both input and output are sequences of tokens such as machine
translation, abstractive summarization, text generation, and text normalization,...
11
Intuitively, one encoder-decoder model is made of two main components: one
encoder and one decoder. Each component is further constituted from smaller
units, called block: encoder-block and decoder block. Each encoder block re-
ceives the output of the previous block as input, try to capture information to
create the hidden state, and forwards it to the following block. The hidden state
can be understood as the code of information in the vector space. Consecutively,
the encoded information of the encoder block is passed through the decoder block
to decode sequentially the list of tokens. In each step of decoding, the decoder
block must use both information from the previous block and from the encoder
block to predict the next tokens until obtaining particular conditions. The final
result is the combination of all predicted tokens. Figure 2 shows the overview
of the encoder-decoder architecture for a machine translation problem (English
Vietnamese). In which, the source sentence in English is passed through the
encoder that is contructed by multi-stacked encoder block. The output of lower
block is used as the input of the next one. Finally, the encoder try to obtain the
context vector by capturing the intra relation between the elements in the source
sentence. Sequentially, the context vector is fed in to the decoder, and create the
final result in target sentence in Vietnamese.
In the next part of the section, I will review detail about applying the Long short-
term memory (LSTM) to the encoder-decoder model, which is presented in [8].
Figure 2.2: The overview of the encoder-decoder architecture for the example is
a machine translation (English Vietnamese).
Encoder The LSTM was introduced in [17], that is Recurrent neural network ar-
chitecture. By introducing three types of new gates: the input gate regulate the
12
amount of information into, forget gate decides how much information will be
discarded for the current step, the output gate controls the piece of information in
the current state to output, LSTM has the ability to deal with the vanishing gradi-
ent in conventional RNN model and capture effectively the information from the
time-series data. With the LSTM as the encoder-block. I denote s = (s
i
)
i=N
i=1
as
the input string. Here, s
i
indicate the token ith. In the encoder block number lth,
x
i
is the embedding of token ith, that is one vector belongs to learnable matrix
embedding, and h
i
is the hidden state of the model at time step ith. The value of
h
i
can be computed base on h
i1
, x
i
as follow:
f
i
= σ
g
W
f
x
i
+ U
f
h
i1+b
f
(2.1)
input
i
= σ
g
(W
input
x
i
+ U
input
h
i1
+ b
input
) (2.2)
output
i
= σ
g
(W
output
x
i
+ U
output
h
i1
+ b
output
) (2.3)
ˆc
i
= σ
c
(W
c
x
i
+ U
c
h
i1
+ b
c
) (2.4)
c
i
= f
i
c
i1
+ input
i1
ˆc
i
(2.5)
h
i
= output
i
σ
h
(c
i
) (2.6)
Where, for the initialization, the value of hidden state at time step 0: h
0
= 0, the
symbol σ
g
indicates the sigmoid function: 0 σ
g
(x) 1, input
i
, output
i
repre-
sent the value of input gate, output gate that regulate the amount of information
be used and diminished, respectively. The symbol σ
h
indicates the tanh function.
Now, I have the h
i
as the output of token ith at the time step. For the following
layers (l+1), h
l
i
will become new input embedding of tokens ith: x
l+1
i
= h
l
i
, and
the similar process will be operated until reaching the last block. Finally, with one
encoder containing k blocks, I have the final hidden state are h
k
=
h
k
1
, h
k
2
, ..., h
k
N
.
These vectors also are called context vectors. The whole encoding process aims
to extract valuable information base on the characteristic of each token x
i
and
the knowledge about the structure of input: sequentiality. Practically, instead of
only using one-directional LSTM as above, researchers usually take advantage
of both-direction of sequence: Bi-directional LSTM for the encoder. Essentially,
Bi-directional LSTM (Bi-LSTM) has the same architecture as LSTM, apart from
the information be combined by forward direction (from left to right) and back-
ward direction(from right to left). The combination is operated by concatenation
operation: h
BiLST M
=
h
h
LST M
,
h
LST M
i
The overview of the encoder with mul-
tiple blocks and integration of the LSTM model for building the encoder is given
in figure 2.3
Decoder Opposite to the functionality of the encoder, the decoder aims to trans-
13
Figure 2.3: The overview of the using LSTM-based as encoder block (left) and
the architecture of LSTM (right).
form the context vectors from vector space into output. Likewise, to the architec-
ture of the encoder block, the decode block is also easily implemented by using
LSTM. The Bi-LSTM is not considered because the decoding process has only
one direction: from left to right. In the work in [8], the decoder only takes the last
vector of context vectors h
k
N
as the initialization of the hidden state of the first
step rather than vector 0 as the encoder. With the decoder containing a K block,
the decoding process is performed sequentially as follows:
At time step j, to avoid the confusion, I denote h
k
N
= h
encode
, the hidden state
of layer kth is computed from h
j1
and the input of layer (k 1)th: x
k1
j
:
h
k
j
= LST M(h
k1
j
, (h
encoder
, h
encoder
)) (2.7)
h
0
j
= x
0
j
= E
y
j
(2.8)
h
0
0
= E
ST ART
(2.9)
The final hidden state h
K
j
is passed into the softmax layer over the entire
vocabulary to find the highest possibility for the next token:
p = softmax(W
p
h
K
j
) (2.10)
y
j+1
= argmax(p) (2.11)
The aforementioned process will be executed until the model obtains the token
14
END or reaches the limit of sequence length. The overview of using LSTM for
the decoder is illustrated in figure 2.4.
Figure 2.4: The illustration of the decoding process.
For enhancing the quality of the decoding process, [18] and [19] introduce dif-
ferent ways to apply attention mechanisms to improve the ability of alignment of
the model. Thanks to this mechanism, the decoder is able to decide to prefer what
token in the encoder will have more contribution for predicting the next token.
2.2.2 Transformer
Recently, Transformer is a new seq2seq architecture, which has achieved high
performance for most NLP tasks [20]. Accordingly, the work in [7] has shown the
advantage of the Transformer compared to RNN-based models in the ITN prob-
lem. Fig. 2.5 depicts the general architecture of the Transformer. Particularly, the
Transformer reconstructed the encoder-decoder architecture that includes stacked
self-attention and point-wise full connection in both the encoder and decoder.
Encoder: The encoder consists of 6 stacked encoder layers. The basic unit of
each unit is a sub-layer or sub-block. The first sub-block is multi-head self-
attention, the other is a 2-layer feed-forward network. Each sub-layer also is
employed a residual connection and one-layer normalization. The dimension of
the embedding vector is set as d
model
= 512.
Decoder: The decoder also compose of 6 stacked decoder layers. In each de-
coder layer, the author adds an intermediate sub-layer, that allows the decoder
can perform attention to the output of encoders. Likewise to the encoder, the first
and last sub-layer still are multi-head self-attention and point-wise feed-forward
15
Figure 2.5: The general architecture of vanilla Transformer, which is introduced
in [20].
layer. The masking mechanism is performed effectively to prevent the wrong
attention to the subsequent positions on the decoder side and to the positions of
padding tokens on the encoder side.
‘The biggest difference and also create the strength of the Transformer compared
to the RNN-based encoder-decoder is multi-head attention. Particularly, by intro-
ducing three types of matrices: Query, Key Value, and scale dot-product attention
and this layer out the new hidden state of each token as the weighted sum of all
considered tokens. When it comes to self-attention, it allows one token has the
ability to decide which level of relevance between it and other tokens around it.
In another case, using the cross-attention mechanism, one token is under the de-
coding process and also decides what tokens in the encoder are most relevant to
it. The illustration of scale dot-production attention and multi-head attention are
given as the figure 2.6.
The detail formulation of scale dot-product attention layer at head ith can be
16
Figure 2.6: The description of scale dot product attention (left) and the multi-
head attention (right).
described as follow:
x
i
=
x
1
, x
2
, ..., x
N
i
= Attention
x = (x
1
, x
2
, ..., x
N
)
i
, W
Q
i
, W
K
i
, W
V
i
(2.12)
= softmax
Q
i
K
T
i
d
k
V
i
(2.13)
Where x
i
and x
i
are the respective input and output of the attention layer. W
Q
i
R
d×d
k
, W
K
i
R
d×d
k
, W
V
i
R
d×d
k
are learnable parameters. Q = xW
Q
i
, K =
xW
K
i
, and V = xW
V
i
are the query matrix, key matrix, and value matrix of head
ith, respectively. d and d
k
denote the hidden size of the model and head ith. The
scale factor
1
d
k
is proposed to get the model rid of too large or too small gradients
problem.
Using multi-head attention can benefit the model to jointly multiple types of in-
formation at different positions:
x
= Multi head(x, W
Q
, W
K
, W
V
) = Concate(head
1
, head
2
, ..., head
k
)W
O
(2.14)
where head
i
= Attention(x
i
, W
Q
i
, W
K
i
, W
V
i
)
(2.15)
17
Sequentially, the log conditional probability can be interpreted as follows:
logP (y|x) =
t=N
X
t=1
logP (y
t
|y
<t
, x) (2.16)
2.2.3 BERT
BERT stands for Bidirectional Encoder Representations from Transformers. BERT
is designed as the architecture of the encoder of the Transformer, to pre-train on
unannotated data, and has the ability to combine both left and right contexts in
all layers. As with other pre-trained language models, BERT was born to apply
to the downstream tasks of NLP. BERT learns the contextual embedding of each
token base on two mechanisms: mask token prediction and next sentence predic-
tion. The overview of the training process for BERT is given as figure 2.7. To
fulfill two tasks simultaneously, BERT is required to investigate the contextual
embedding of all tokens of the input sentence. Due to being trained in a large
corpus, BERT is able to collect a huge knowledge of a particular language and be
very useful for enhancing the performance of downstream NLP tasks.
Figure 2.7: The overview of pertaining produces of BERT that is trained in large
corpus with the next sentence prediction and mask token prediction.
Here, I only focus on analyzing the mask token prediction. In the training pro-
cess, there are a certain amount of tokens in the original input to be masked. In
the final layer, the hidden state of tokens at the corresponding position is passed
18
through one softmax layer over the vocabulary to predict precisely the masked
tokens. Intuitively, to perform well this task, a model must learn the interaction
between cooccurrence tokens in one sentence. In other words, a model must
learn the entire context in input sentences. Thanks to capturing all context of sen-
tences, transferring knowledge from BERT to another model to deal with NLP
downstream tasks can advance the performance significantly.
For the seq2seq model, especially for machine translation, incorporating BERT
is considered in detail in [21]. In this research, the author introduces three ways
to utilize BERT to enhance the performance of the machine translation model:
Use BERT as initialize of the encoder of an NMT model, that work is similar
to [15]
Use BERT embed as inputs to the NMT model. Inspired by the study in [22],
the authors take the output of the last layer in BERT and feed it to inputs of
the NMT model.
Extending from the second way, the authors also leverage the output of
BERT to have a wealth of information, and the BERT embed is incorporated
directly into each layer of both the encoder and decoder.
Despite reaching some promising results, the author also revealed the remarkable
limitations of leveraging BERT to NMT model are raising the storage cost, and
increasing the inference time.
19
Chapter 3
Methodology
3.1 Baseline model
In this section, I revise my baseline model that was built based on the presented
model in [7]. The overview of this model is shown in figure 3.1. Basically, this
model only contains one encoder-decoder module. The detail of the operation
of the encoder-decoder model is given in section 2. Specifically, for the output
of the decoder, the numerical entities are split into a sequence of characters. For
the final stage, the output is fed through the post-processing stage to remove
the redundant space and concatenate all characters to form the correct number.
The encoder-decoder can be implemented by two well-known models: RNN-
based, and transformer-based architecture as the backbone. For the comparison,
I propose my method and also implement my model with the aforementioned
models and compare whether the proposed model has a benefit in the same type
of data and the same backbone model in comparison with the baseline. In this
thesis, I only focus on the neural network-based model and do not compare my
results with the WFST model.
3.2 Proposed framework
In this thesis, I propose a novel hybrid model Neural ITN problem using seq2seq
neural network and rule base systems by considering the ITN task as the Machine
Translation (MT) task. Fig.3.2 describes general my framework, which includes
two main stages.
Specifically, in the first stage, each sentence is put into a transformer-based seq2seq
20
Figure 3.1: The overview of my baseline model (the seq2seq model for ITN
problem) [7].
model for detecting numerical segments by using tag < n > and < /n >, in which
n represents a numerical class (e.g., DATE, CARDINAL, ORDINAL). I call this
module under the name: Number recognizer. Then, a set of rules is employed
to convert tokens, which be wrapped by tag to number, into the written form.
Otherwise, all parts of a sentence, which are not in the tag are conserved. I name
this module is Number converter. Particularly, instead of using a neural network
Figure 3.2: The general framework of the proposed method (hybrid model) for
the Neural ITN approach.
to directly translate numbers as [7], I only use NN to detect numerical segments.
This idea has close relation with the study in [11] in which the authors also divide
the handling process for TN into two steps: recognizer and converter. Essentially,
NN is only utilized for distinguishing which is in the number and which is not.
After that, when the model has candidates for numbers, they are transformed to
the correct form by the set of rules. Consequentially, the model is able to read
accurately numbers in sentences.
21
For example, if I have input spoken sentence: the population as of the Canada
twenty eleven censuses was one thousand one hundred twenty’. After passing
through this sequence to NN model, the output will be formalized as: the pop-
ulation as of the Canada < DAT E > twenty eleven < /DAT E > census was
< CARDINAL > one thousand one hundred twenty < /CARDINAL > . As a
result, numerical phrases twenty eleven and one thousand one hundred twenty
are wrapped, and transformed to written form by rules: ’27’ and ’1120’ based on
two class DATE and CARDINAL. A set of rules, which are utilized can be con-
sidered as the replacement for a great deal of knowledge in the training process.
In comparison to the baseline model, my proposed model has two following ad-
vantages as follow:
When faced with a scenario: low resource, one end2end model can be hun-
gry for data while it has to deal with two tough consecutive problems: how
to recognize exact numbers and how to rewrite them in the right form. Es-
pecially, for the complex numbers (too long or too big), these problems are
more serious because if at least one error appears in two processes, the model
fails. Different from this method, my proposal only requires the seq2seq
model recognizer number. This makes the amount of information that the
seq2seq model has to learn less than the baseline model and boosts the per-
formance of the Number recognizer. On the one hand, for the second task:
how to convert precisely the numbers, using rule also have clear accuracy
rather than using one neural network to learn this process.
As I mentioned in section 2, one of the severe problems of using the end2end
model for solving ITN problems is unrecoverable error. For my proposal,
this phenomenon can be controllable. The main reason is that when the
model encounters the error in Number Recognizer (example: < DAT E >
twenty-three point five < /DAT E >), the Number converter instantly raise
an exception and not parse it furthermore, the text ’twenty-three point ve’ is
preserved to avoid unpredicted and silly results. This manner is very helpful
in real production because the result prefers preservation rather than being
converted to the wrong output.
However, when I put the baseline model along with my proposed under the huge
data scenario. The strength of the neural network for the baseline model is per-
formed. In this situation, the model has an antique amount of data to train ef-
fectively. Therefore, I also consider the effectiveness of both models in the two
above situations: low and wealthy resource.
22
The remaining parts of this section are organized as follows: in section 3.3, I in-
troduce the pipeline I use to create the pair-sample for training the Number Rec-
ognizer, section 3.4 describe the way I leverage the backbone model for building
the Number recognizer, which include: RNN-based, Transformer-based, Bert
based, Recog-Num-Bert based seq2seq model, the section 3.5 show the process I
build the Number converter.
3.3 Data creation process
Almost research in ITN problems reverses the pair-sample in TN dataset to the
data for ITN problems. In addition, my proposal uses the novel data set that
requires the annotated number tag in the decoder side, following the work in [7],
I employ a novel data generation pipeline for ITN using the TTS system. Fig.
3.3 shows the main steps of my data creation process, which are sequentially
described as follows:
Figure 3.3: The overview of data creation pipeline for training Number recog-
nizer.
Step 1: Crawling/downloading raw data from published websites.
Step 2: Cleaning raw data (e.g., removing HTML, CSS, and so on), and
removing noise documents.
23
Step 3: Detecting non-standard words, for instance, alphabet (e.g. ’David’)
or number (e.g. ’2017’), in a sentence. For handling alphabet words, I split
them into characters (extremely subword) and bound them by using tags
< oov > and < /oov >. For handling numerical words, I split them into
sequence digits and bound them by couple tag < n > and < /n >.
For English, < n > indicates the name of the semiotic class: DATE,
CARDINAL, ORDINAL, TELEPHONE, ...
For Vietnamese, because the Viettel TTS system (detail in step 4) do not
provide the particular name of each semiotic class, < n > is preserved.
Step 4: Passing output sentences through TTS systems. I use the Google
TTS system for English and Viettel TTS system for Vietnamese.
Step 5: The output sentences from TTS systems are used as target sentences
for NN models. For creating source sentences, I remove punctuation, tag
< n >, lower all tokens, and only preserve tag < oov > similar to the spoken
form.
Step 6 Saving data to file, simultaneously.
3.4 Number recognizer
In this subsection, I present how I build the Number recognizer. As I mention
above, I model the ITN as an MT problem where the source and target are the
output of spoken form and the detected segments of text, respectively. For NN
models, I implement two main strategies for training models which are RNN-
based and transformer-based seq2seq models. More specifically, applying the
transformer-based seq2seq model can be further categorized into two ways: 1)
using the vanilla transformer as in [20] with the random initializer 2) using exter-
nal pre-trained language model as initialize point of parameters of the encoder in
seq2seq model. With this division, I organize this section into three main subsec-
tions: RNN-based and Vanilla transformer-based, BERT-based, and RecogNum-
BERT based.
3.4.1 RNN-based and vanilla transformer-based
For the RNN-based model, I employ a two-stacked bi-direction long short-term
memory network (bi-LSTM) as the encoder and a two-stacked long short-term
24
memory network (LSTM) as the decoder. For decoding, I implement the attention
mechanism similar to [18] and [7] to boost the performance.
For the non-recurrent model, I implement a vanilla transformer model based on
the work in [20] with 6 stacked layers in both the encoder and decoder. The
source and target sentences are also segmented into subword sequences [12].
After obtaining the output of the seq2seq model, I use the post-processing to spit
out the final results.
For applying the RNN-based and Vanilla Transformer-based model, I implement
these experiments in both the baseline model and my proposed model. The detail
about LSTM or Transformer architecture is described in Chapter 2.
3.4.2 BERT-based
There are several paths to applying the pre-trained language model for improv-
ing ITN tasks. As in the results showed in [7], the author figure out that using
Bart [14] for fine turning both of encoder and decoder did not increase the perfor-
mance. Whereas, using Bert to extract the wealthy embedding and fuse them into
each layer of both the encoder and decoder can improve accuracy slightly but be
complex and difficult for implementation. Also, inspired by works in [21], I build
the Seq2Seq model with the parameters and tokenizer of the encoder being bert-
base-uncased [15]. By training in tremendous data, BERT may help model better
in recognizing numerical tokens tasks. Encoder is made of 12 layers, 768 neu-
ral cells hidden state. For consistency, the decoder is constructed with 6 layers.
Similar to the works in [7], I only consume the bert-base-uncased for my experi-
ments. Besides, I also trained my model with the alternative Language model as
Roberta [23], but did not see the improvement.
The illustration of the training process and inference process is given in detail in
figure 3.4. In the training stage, the parameter of Bert is the start point of the
encoder, while the parameter of the decoder is initialized randomly. All of the
parameters are trained simultaneously to create the number recognizer. In the
inference process, the output of the number recognizer is further rewritten by the
number converter to out the final results
3.4.3 RecogNum-BERT-based
The BERT model presented in section 2 has the ability to provide the model with
the contextual information of entire tokens of input. To further investigate the ef-
25
Figure 3.4: The training process and inference process of applying Bert as initial-
izing encoder for the proposed model.
ficiency of using a pre-trained language model for enhancing number recognizer
in stage 1, I proposed the novel pre-trained language model, which will be pro-
vided specific information about the position of numerical entities in text. I call it
RecogNum-BERT. Essentially, RecogNum-BERT is one type of model similar
to BERT and is supplied additional information about number recognition.
By masking the small number of tokens and using the output hidden state in the
final layer to predict the absence, the original Bert is able to out the contextual
embedding of tokens with the appearance of other tokens in sentences. Like-
wise, to provide information about the appearance of numerical tokens, I detect,
save the position of number tokens, and remove them from an initial sentence.
Therefore, only alphabet words are maintained and become the input of the pre-
trained language model. The embed feature of tokens will be passed through a
pre-trained model and yield the hidden states. Finally, the hidden state will be
fed into the binary classifier to predict whether that token is the start of numerical
tokens. In this way, the model can be trained how based on the value of other
tokens in a sentence to detect the position of the number.
The training process can be divided into 2 main steps: 1) fine-tuning Bert with
additional loss to creating the new variant: RecogNum-Bert and 2) leveraging it
for training the Seq2seq model.
The overview of the training process and inference process of my proposal is
explained in figure 3.5. In particular, the training stage also consumes two crucial
steps. The first step is fine-tuning BERT with additional loss over the modified
data to invent RecogNum-BERT. Consequentially, RecogNum-BERT is further
fine-tuning as a part of my Seq2Seq architecture to create the number recognizer.
26
The inference stage is design similar to the process in the previous section.
Figure 3.5: The overview of the training and inference process of my proposed
model when integrating the RecogNum-BERT.
3.4.3.1 Fine tuning RecogNum-BERT
Aiming to supply information about the appearance of the numerical tokens in a
sentence, I create RecogNum-BERT by fine-tuning bert-base-uncased [15] in the
modified data set and modified loss function.
Modified loss function In this case, the final hidden vector of all tokens will
be fed into a binary classification layer with sigmoid activation. I denote
the label of a token ith as y
i
. The label y
i
= 1 indicates the token ith is
stand right in front of numerical entities in the original sentence and y
i
= 0
for another case. This auxiliary loss is called: Number recognizing loss.
In order to train the RecogNum-BERT, the loss function is designed as the
sum of the conventional loss: Masked token prediction loss and Number
recognizing loss. I also consider training the RecogNum-BERT with only
numerical loss. In this process, I ignore the Next sentence prediction loss as
the traditional Bert. The overview of the proposed architecture is described
in figure 3.7
Modified dataset The process of creating the samples for fine-tuning RecogNum-
BERT is described as consecutive steps:
Collecting the raw data that in written form.
Preprossessing raw text (tokenize, lower, clean off the punctuation, re-
move the noise words, ... ) to fit the spoken form.
27
Figure 3.6: Our architecture for creating the RecogNum-Bert.
Detecting the numerical entities in text, keeping the raw position of the
tokens standing in front of the number. Remove all of that token from
the original tokens to form the input text, and save the list of positions
as the labels.
Pairing the input text vs label as the training sample of the model.
3.5 Number converter
The output of NN models with detected segments is transformed into the written
form using a set of rules: the Number converter. Tab. 3.1 demonstrates an
example with the input is the output sequence of the first stage and final outputs
of this process. In this thesis, I concentrate on building a set of rules for converting
numbers for both languages: English and Vietnamese.
English: I use the python libary word2number1 package
1
for converting spoken
numbers into written numbers in English. In particular, since the tool works only
for positive numbers and the largest value is limited to 999,999,999,999, I ex-
tended the tool in order to handle negative cardinals and larger numbers. More-
1
https://pypi.org/project/word2number/
28
Figure 3.7: The pipeline for data preparation for fine-tuning RecogNum-Bert.
Spoken-form Label After Label Written-form
he NONE he he
collected NONE collected collected
four CARDINAL start
CARDINAL
400000
hundred CARDINAL in four hundred thousand
thousand CARDINAL end
/CARDINAL
records NONE records records
Table 3.1: An example output in Number converter.
over, I also construct the Python modules for reading complex number, which
belongs to other classes such as MEASURE, DATE, PHONE, TIME... and so on
based on the aforementioned extended tool.
Vietnamese: I use the open source python libary: vietnam-number
2
that support
the processing number for Vietnamese. This library provides a wide range of
number single-reading (only one number at one time), double-reading(two num-
bers at one time), huge number (up to 999 999 999 999), the informal reading
style,... Similar to English, I also extend the range of numbers that the model can
handle and incorporate them into the specific numerical class for Vietnamese.
2
https://pypi.org/project/vietnam-number/
29
Chapter 4
Experiment
4.1 Datasets
Regarding evaluation datasets, I test my method on two different language datasets
such as English and Vietnamese, which are extracted from publicly available data
sources as follows:
English Dataset: the original version consists of 1.1 billion words of En-
glish text from Wikipedia, divided across 100 files. The normalized text is
obtained by running through Google TTS system’s Kestrel text normaliza-
tion system [2]. In this study, I use the first file which contains approximately
4.4 million samples (4401098 samples) for my experiments. I split randomly
the file and extract two sub-parts in which the first part includes 1 million
sentences for training data, and the second part contains 50,000 sentences
for testing. As with all of the previous research about the ITN problem, I
conserved all tokens of sentences and swapped input and output.
Vietnamese Dataset: the raw dataset is extracted from the largely published
source
1
. After that, I decode them as UTF-8 and remove all sentences in-
cluding tags (e.g., Html and CSS). I also extracted 1 million and 50,000 sen-
tences for training and testing data, respectively. Sequentially, I execute the
dataset through Viettel TTS System
2
. For formatting into the ITN dataset, I
construct the data following the pipeline in Section 3.2.
Because this study considers the effect of my proposed model on both low-
resource and wealthy scenarios, I divide the training data into various numbers of
1
https://github.com/binhvq/news-corpus
2
https://viettelgroup.ai/
30
samples such as 100k, 200k, 500k, and 1000k, respectively, in order to evaluate
the advantage of the proposed method. Specifically, Tab. 4.1 illustrates the size
of the train set, the size of the validation set, the size of the vocab (the number of
unique syllables), and the average sequence length of the input in both datasets.
Overall, the vocabulary size of English is greater significantly than that of Viet-
namese due to the unit of token in English being a word, while that in Vietnamese
is the syllable. Besides, the length of sentences in English is also shorter remark-
ably than that of Vietnamese (approximately 19.4 in English compared to 28.6 in
Vietnamese).
Language Dataset Training Validation Vocabulary Size Avg Seq len (input)
English
100k 80k 20k 35903 19.4
200k 160k 40k 47328 19.5
500k 400k 100k 65079 19.3
1000k 800k 200k 80585 19.4
Vietnamese
100k 80k 20k 7223 28.7
200k 160k 40k 8225 28.8
500k 400k 100k 10400 28.6
1000k 800k 200k 11657 28.6
Table 4.1: The training size, validation size, vocabulary size, and average se-
quence length of the input of my datasets.
4.2 Hyper-parameter configurations
4.2.1 RNN-based and vanilla transformers-based configura-
tions
In both the baselines and proposed model, I implement the same setting with
RNN-based and vanilla transformer-based models. Specifically, the Neural ITN
is regarded as the MT problem. Furthermore, all models are implemented with the
subword approach, which has proven the better performance [12]. Particularly,
the baseline models are sequentially configured as follows:
RNN Model: For the recurrent seq2seq baseline model, I use Encoder-
Decoder architecture with RNN-Encoder consisting of two bi-directional
long short-term memory (Bi-LSTM) layers and two LSTM layers for the
decoder. Both the encoder and decoder contain 512 hidden states. Global
attention mechanisms [18] is implemented in the decoder.
31
Transformer Model: For the non-recurrent seq2seq baseline model, I im-
plement the architecture, which is similar to [20]. Specifically, I employ
the subword-transformer model with 6 layers for both Encoder and Decoder.
Each sub-layer block contains a 512-dimension vector hidden state. Further-
more, the number of multi-head self-attention is set to 8.
Consequentially, I execute my proposed method with two versions by adopting
two aforementioned baseline models for the first phase of segment detection, re-
spectively. For the hyperparameter configurations, I use Adam optimizer with
learning rate annealing and the initial value is 0.01, and the dropout is set to 0.1.
All models are trained with 100k time steps, early stopping is used on validation
loss.
4.2.2 BERT-based and RecogNum-BERT-based configurations
For leveraging the pre-trained language model, I use the bert-base-uncased [24]
model and another variant of it, named: Bert-RecogNum. Both pre-trained lan-
guage model is considered as the initial of the encoder and is fine-tuned with
other parameters of the Seq2Seq model simultaneously.
For fine-tuning bert-base-uncased, I use the train batch size is 10000 to-
kens/batch, number decoder equal to 6, dropout is 0.1, and the learning rate
is 0.001.
For creating modified data for fine-tuning RecogNum-BERT, I use the full
data set as I mention in section 4.1, apart from the data belonging to the test
set. Nevertheless, I shuffle it one more time to create a new dataset for this
task. Consequentially, I clean off the noise sample that error when I use the
tokenizer of bert-base-uncased. Finally, I extract more than 3.9m sentences
for the train set and 5000 samples for the validation set.
For training RecogNum-BERT, I tuned the parameters of bert-base-uncased
until obtaining the lowest loss in the validation set. The learning rate in this
process is set up is 0.001. Consecutively, I respectively train RecogNum-
BERT with other parameters of the decoder with batch size, learning rate,
and dropout similar to that of bert-base-uncased.
32
4.3 Evaluation metrics
Following the evaluation in the work [7], I report three types of metrics to prove
the efficiency of my proposed model: BLEU, WER, and NP.
4.3.1 Bi lingual evaluation understudy (BLEU)
BLEU is the popular way to measure the performance of a machine translation
system, which is calculated as:
BLEU = BP exp
ngrams
X
n=1
w
n
logp
n
!
(4.1)
where P
n
is the modified n-grams precision and virtually is set up by 4. w
n
denotes the uniform weight, w
n
=
1
ngrams
, in case n-gram = 4, I have w
n
=
1
4
.
BP refers to the brevity, which can be calculated as follows:
BP =
1, if c < r.
e
1
r
c
, otherwise.
(4.2)
where c and r refers length of the candidate and reference sentences, respectively.
BLEU is greater than 0 and less than 1. BLEU is greater is better.
4.3.2 Word error rate (WER)
WER is a common metric of the performance of speech recognition or machine
translation system, which is calculated as:
W ER =
S + D + I
S + D + C
(4.3)
where
S is the number of substitutions word
D is the number of deletion word
I is the number of insertion word
C is the number of correct words
33
In contrast to BLEU, WER is also greater than 0 but can even go toward positive
infinite. WER is lower is better.
4.3.3 Number precision (NP)
NP aims to compute the ratio of precise numerical entities that are predicted by
my model over a total of them in the whole test set. The formulation is:
NP =
C
T
(4.4)
where C is the number of the numerical objects that are correctly predicted, and
T is the total numerical objects in the test set.
4.4 Result and Analysis
To show off the efficiency of my proposed model in comparison to the baseline, I
organize this section into two parts:
The experiments without pre-trained LM: I use the shared backbone: RNN-
based model and Vanilla Transformers-based model and apply them to both
the baseline model and the number recognizer of my proposal. The experi-
ments are conducted over all the ranges of data from 100k to 1m to see the
fluctuation of results and the correlation of it on a such volume of datasets
in English and Vietnamese. The scores used in this experiment are BLEU,
WER, and NP. I provide the evaluation for separate modules in tables: 4.2
and the results of hybrid model the in tables: ??, ??.
The experiments with pre-trained LM: I apply BERT and RecogNum-BERT
into the Number recognizer. Applying BERT, I would like to examine whether
transferring the knowledge from a pre-trained model like BERT to the seq2seq
model bring the benefit for ITN problem or not. Apply RecogNum-BERT
rather than BERT, I would like to test whether supplementing the informa-
tion about detecting the numerical tokens to pre-trained BERT and transfer-
ring this knowledge can benefit the overall model or not. The metrics be in
used are BLEU and WER. In this experiment, I do not take into account the
Vietnamese data set.
34
4.4.1 Experiments without pre-trained LM
4.4.1.1 Results of separate modules
a) Results of Number Recognizer: For evaluating the performance of the Number
Recognizer, we report the results of the seq2seq model in the validation set with
BLEU score (figure 4.2).
Validation/Train set
English Vietnamese
RNN Transformer RNN Transformer
20k / 80k 0.9028 0.8396 0.855 0.8327
40k / 160k 0.9128 0.8888 0.8635 0.8928
100k / 400k 0.915 0.9025 0.8591 0.8986
200k / 800 0.917 0.9323 0.8559 0.8988
Table 4.2: The results of Number Recognizer in the validation set with BLEU
score.
Accordingly, there are several crucial things that I can conclude as follows:
Overall, the good quality is seen in the Number Recognizer for both ap-
proaches using RNN or Transformer. The BLEU score of the modules in-
creases gradually when I supplement additional data for the training process.
When the volume of data reaches the peak (1m samples), the Number Rec-
ognizer obtains 0.9323 BLEU score with the Transformer model in English
and 0.8988 BLEU score in Vietnamese.
Regarding the English dataset, applying the RNN for Number Recognizer
gives a better performance than Transformer in the range of data from 100k
to 200k. That proves the advantage of the RNN-based model over than
Transformer-based model in the low resource. By contrast, when the volume
of data is increased (from 500k to 1m), the performance of the transformer-
based model is better than that of the RNN-based. Besides, when we grad-
ually supplement the data for Number Recognizer, the performance of RNN
is limited by approximately 0.91 while that result for the Transformer-based
model still gains improvement (roundly 0.93).
In terms of the Vietnamese dataset, I also see a similar trend. Especially, the
performance of Vietnamese is lower than that of English in all of the cases.
The crucial reason behind this phenomenon is the average sequence length
of the Vietnamese sample is longer than that of the English (table 4.1), which
may lead to a bad influence on the performance of the seq2seq model.
35
b) Results of Number Converter: The validation dataset spending on measur-
ing the performance of the Number Converter has to meet the following require-
ments:
The input sentence must be supplied 2 crucial information: the content of
entities (e.g. twenty-two, November, ten dola) and the type of number that
content belong to (e.g. DATE, TIME, MEASURE, ...).
The gold output must be supplied in the written form of corresponding nu-
merical entities (e.g. 22, 10$, ...).
Unfortunately, there is no available dataset that can satisfy the aforementioned
criterion. Therefore, I do not conduct the experiment to figure out the perfor-
mance of only the Number Converter. The quality of this module is proven along
with the performance of the hybrid model.
4.4.1.2 Results of hybrid model
Figures 4.1, 4.2, 4.3 show the comparison results of my experiments on the En-
glish test set in BLEU, WER, and NP score, respectively.
Figure 4.1: The comparison of models on English test set with BLEU score
(higher is better).
Note that for the BLEU and NP scores, the higher is the better, contrary to the
case of WER score. Accordingly, there are several issues I can summarize based
on the results as follows:
36
Figure 4.2: The comparison of models on English test set with WER score (lower
is better).
For the smallest amount of data (100k), my method using RNN as the back-
bone achieves the highest score in all of the criteria: 0.8334 for BLEU,
0.1017 for WER, and 0.8566 for NP. In both the baseline and my proposal,
the performance of RNN surpasses that of the Transformer.
For 200k samples, my proposed model applies RNN and obtains the highest
BLEU score: 0.8353. Whereas, the apply the Transformers model get the
highest WER: 0.1003 and highest NP: 0.8611.
In the range of data 100k and 200k, my proposed model exposes the advan-
tage when compared to the performance of the baseline. Additionally, the
RNN-based model also helps benefit the model significantly more than the
Transformers-based model in the smallest data set (100k). When the amount
of data progressively increases, the performance of the Transformer-base
model is better than that of the RNN-based.
In data set 500k, the transformer witnesses dominance in both the baseline
model and my proposed model. Especially, the highest performances are
0.8558 BLEU score and 0.0708 WER score for the baseline model while my
proposed model achieves the best result on NP score: 0.8687. In this volume
of data, the baseline with only the seq2seq model starts to take over the
performance of my proposed model. This phenomenon proved that when
facing wealthy resources (more than 500k in English), the strength of the
neural network is shown off. my proposed model with stage 2 is Number
37
Figure 4.3: The comparison of models on English test set with NP score (higher
is better).
Converter is built up by a set of rules that has less generalization and is more
disadvantageous than the end2end model.
In the largest volume of data (1m samples), the highest value of all criteria
BLEU score, WER score, and NP score is seen by the baseline model with
Transformer-based. These values of applying the Transformers-based are
not only remarkably higher than that of utilizing the RNN-based model but
also slightly higher than the best figures come from my proposed model:
0.9138 in BLEU to 0.8933, 0.0405 in WER score to 0.0517, 0.9318 in NP
score to 0.9105.
Based on the aforementioned results, three of type metrics evaluation show a
high correlation when evaluating the performance of the ITN problem. The
most crucial thing is that the value: 500k can be considered as the threshold
for the boundary to distinguish between low and antique resources for the
English data set.
The results of my experiments in Vietnamese are presented in the figures: 4.4,
4.5, 4.6. Particularly, there are several remarkable points can be highlighted as
follows:
The data size 100k witnesses the best results in my proposed model. Specifi-
cally, the best BLEU score is 0.7019 and the best WER score is 0.1897 when
using the RNN model, whereas the best NP is 0.6359 using the Transform-
ers model. Notably, the gap between the results values of the two methods
38
Figure 4.4: The comparison of models on the Vietnamese test set with BLEU
score (higher is better).
is seriously huge. Especially, regarding the BLEU score, my model with
RNN-based is higher significantly than that of the baseline model by 0.025.
The number for WER and NP are 0.06 and 0.3, respectively. This huge value
is clear evidence to prove the efficiency of my approaches when the number
of samples is extremely small.
For the data size: 200k, my method with the Transformers model lead to
others in all of the criteria: 0.7286 for BLEU, 0.1774 for WER, and 0.6775
for NP. In terms of the baseline, although reaching the approximate value in
BLEU, the figures of this method are notably less than that of my method by
roughly 0.04 WER score and 0.2 NP score.
There are similar trends in the figures of the data 500k and 1m. Firstly,
when applying the RNN model and providing more data for both the base-
line model and my approach, I do not see an improvement. This indicates the
limited ability of the RNN model when applying it to Vietnamese. Leverag-
ing the Transformers in my method still surpasses the remaining. my method
achieves 0.7885 for the BLEU score, 0.1199 for the WER score, and 0.699
for the NP score. When compare to the numbers of the baseline method,
my results are still considerably higher than by 0.03 BLEU score, 0.06 WER
score, and 0.1 NP score.
When comparing the performance of the baseline model, proposed model in
Vietnamese and English, it is clear that the outcome of models in Vietnamese
39
Figure 4.5: The comparison of models on the Vietnamese test set with WER score
(lower is better).
tends to be lower than that of English. For instance, in the case 1m sample, our
proposed model + Transformers obtains only 0.7885 BLEU score compared to
0.8933 BLEU score of English. The lower result on the Vietnamese test set may
come from several issues as follows:
The average sequence length of the Vietnamese sample is pretty long (as
in table 4.1). This may have a strong influence on the performance of the
seq2seq task on the test set.
On the Vietnamese test set, the seq2seq model may produce repeat words,
and harm severely the performance of the hybrid model.
On the Vietnamese test set, there are some cases that one number can exist
in different types. For instance: ’500 ngh
`
ın’ and ’500.000’ (five hundred
thousand). Although they bring the same meaning but the above difference
can create a lower BLEU or WER score.
Some examples that fail in Vietnamese can be seen in the table 4.6.
In both English and Vietnamese, the recurrent-based seq2seq models with atten-
tion achieve better performance compared with Transformer in the case of low-
resource scenarios. Meanwhile, Transformer-based models are able to achieve
the best results by increasing the number of training samples. Therefore, com-
bining two methods (hybrid models) is able to improve performance. I take this
issue as my future work regarding this study.
40
Figure 4.6: The comparison of models on the Vietnamese test set with NP score
(higher is better).
4.4.2 Experiments with pre-trained LM
Table 4.3 reports my results about the BLEU score on four approaches to ap-
ply different types of models to build my number recognizer: vanilla transformer,
BERT, RecogNum-BERT + numerical loss + mask token prediction loss, RecogNum-
BERT + numerical loss only. Look at the table, I can highlight some remarkable
point as follow:
It is clear that when the range of data is from 100k to 500k, my method
with BERT and other variants of BERT witnesses a significant improvement
compared to only using the vanilla Transformer model. The aforementioned
difference is largest when data only have 100k samples and gradually de-
creases when I supplement data. This evidence shows the efficiency of my
proposal when taking advantage of the pre-train language model for improv-
ing the quality of the Number Recognizer module in the low resource.
Nevertheless, similar to the result in section 4.1, when the size of the data
is big enough, the knowledge from the pre-train language model may not be
helpful for this module. This conclusion is clear when in I look at the result
in the final column (1m), in which the BLEU score for vanilla Transformers
reaches the peak at 0.8933 while the figures for the three remaining methods
only are: 0.8794, 0.8781, 0.8801, respectively.
Loot at the performance of three type model pre-trained, over all of the range
41
Training Data 100k 200k 500k 1m
Our-Vanilla Transformer 0.741 0.8188 0.8394 0.8933
Our-BERT 0.8697 0.8772 0.8779 0.8794
Our-RecogNum-BERT(num loss + mask
loss)
0.8554 0.8738 0.8759 0.8781
Our-RecogNum-BERT(num loss) 0.8662 0.8782 0.879 0.8801
Table 4.3: Comparison between variants of the proposed method with differ-
ent encoders for the module number recognizer: Transformers Base, BERT,
RecogNum-BERT in Bleu score.
of data, RecogNum-BERT with 2 loss brings the worst benefit. Meanwhile,
the performance of RecogNum-BERT with numerical loss only is compara-
ble to BERT in the 100k dataset and surpasses all in other datasets. Despite
only being a minimal improvement, using additional numerical loss predic-
tion also prove the contribution for building the better Number Recognizer.
Training Data 100k 200k 500k 1m
Our-Vanilla Transformer 0.152 0.1003 0.0874 0.0517
Our-BERT 0.06195 0.0565 00542 0.0535
Our-RecogNum-BERT(num loss + mask
loss)
0.0783 0.059 0.0569 0.05398
Our-RecogNum-BERT(num loss) 0.0666 0.0550 0.538 0.05295
Table 4.4: Comparison between variants of the proposed method with differ-
ent encoders for the number recognizer: vanilla transformer, BERT, RecogNum-
BERT in WER score.
Table 4.4 express my results in WER score for all of my approaches in all range
of data. According to this table, I can easily see some points as follows:
There is a strong correlation between BLEU score and WER score when
I can see a similar trend in both criteria. With low resources like 100k,
200k, and even 500k, applying BERT and variants of BERT still bring many
benefits with considerable amelioration.
Although the improvement is not too big, modifying the additional numeri-
cal loss to the original BERT shows the efficiency in advancing the perfor-
mance of the whole model.
42
4.5 Visualization
In this section, I visualize some of the examples in English of both spoken form
and written form (table 4.5). Besides, some samples that are predicted wrongly
in Vietnamese also can be found in the table 4.6.
Spoken-form Baseline method Our method
up to ve hundred thousand dollars up to $5000000 up to $500000
one hundred eleven point nine people 11.9 people 111.9 people
nine trillion seven hundred eighty one
billion nine hundred eight million four
hundred forty three thousand seven
97819984437 9781908443007
four million eight hundred eleven thou-
sand one
48111 4811001
Table 4.5: Examples for prediction error of the baseline model in English.
Spoken-form (Vietnamese - English) Our method Gold answer
y số may mắn không một không bảy
không tám không chín (the lucky num-
ber sequence is zero one zero seven zero
eight zero nine)
y số may mắn
1789 (the lucky
number sequence
is 1789)
y số may mắn
01 07 08 09
(the lucky num-
ber sequence is
01 07 08 09)
mất từ sáu trăm bảy trăm năm mươi nghìn
(lost from six hundred seven hundred and
fifty thousand)
mất từ 600 750000
(lost from 600
750000)
mất từ 600 750
nghìn (lost from
600 750 thou-
sand)
số điện thoại không tám hai hai một năm
hai ba sáu chín (the phone number zero
eight two two one five two three six nine)
số điện thoại 8
2215 2369 (the
phone number 8
2215 2369)
số điện thoại
(08)22152369
(the phone
number
(08)22152369 )
sáu đến mười một phần tinh bột (six to
eleven parts starch )
6-11phầntinhbột
(6-11phầntinhbột)
6-11 phần tinh
bột (6-11 parts
starch)
Table 4.6: Examples of error prediction of the proposed model in Vietnamese.
43
Chapter 5
Conclusion
5.1 Summary
In this study, I introduce a new method for the neural ITN approach. Specifically,
the difference from previous works, I divide the neural ITN problem into two
stages. Particularly, in the first stage, neural models are used to detect numerical
segments: number recognizer. Sequentially, the written form is extracted based
on a set of rules in the second stage number converter. In this regard, my method
is able to deal with low-resource scenarios, where there is not much available
data for training. Furthermore, I showed that my method can be easily extended
to other languages without linguistic knowledge requirements. The evaluation of
two different language datasets (i.e., English and Vietnamese) with different sizes
of training samples (i.e., 100k, 200k, 500k, and 1000k) indicates that my method
is able to achieve comparable results in the English language, and the highest
results in Vietnamese languages.
Moreover, by leveraging the strength of the pre-trained language model (BERT),
I improve the performance by utilizing the parameter of it as the initial parameter
of the number recognizer. In addition, to further explore the effect of additional
knowledge about the appearance of numerical entities, I propose the new vari-
ant of BERT, call: RecogNum-BERT, and apply it successfully to the number
recognizer and also witness the minimal improvements.
Finally, to my knowledge, my work is the first study that considers the ITN prob-
lem under the scenario: low resource data. I hope this work can promote the
interest of research to enhance the performance of ITN tasks in Vietnamese and
other low-resource languages.
44
5.2 Future work
Regarding the future work of this study, I have some ideas to focus on to further
advance the quality of my proposed model as follows:
As the result in section 4.4.2, transferring the knowledge from the pre-trained
language model into the number recognizer brings huge advantages. How-
ever, in this work, I only take a well-known model like BERT as the back-
bone of my experiment. BERT is trained in corpus with enormous domains,
and have the ability to generalize to many NLP task. This trait also makes
BERT less competitive than one specific pre-trained model that only belongs
to a certain domain. Therefore, I would like to rebuild one pre-trained lan-
guage model with the data coming totally from the domain of spoken lan-
guage only. The data will be collected from the data in spoken language from
the ASR system. Nevertheless, this process can cost severe computational
resources and time complexity.
45
REFERENCES
[1] S. Pramanik and A. Hussain, “Text normalization using memory augmented
neural networks”, Speech Commun., vol. 109, pp. 15–23, 2019. DOI: 10.
1016/j.specom.2019.02.003.
[2] P. Ebden and R. Sproat, “The kestrel TTS text normalization system”, Nat.
Lang. Eng., vol. 21, no. 3, pp. 333–353, 2015. DOI: 10.1017/S1351324914000175.
[3] Y. Zhang, E. Bakhturina, and B. Ginsburg, “Nemo (inverse) text normal-
ization: From development to production”, in Interspeech 2021, 22nd An-
nual Conference of the International Speech Communication Association,
Brno, Czechia, 30 August - 3 September 2021, H. Hermansky, H. Cernock
´
y,
L. Burget, L. Lamel, O. Scharenborg, and P. Motl
´
ıcek, Eds., ISCA, 2021,
pp. 4857–4859. [Online]. Available: http:/ /www. isca- speech.
org/archive/interspeech\_2021/zhang21ja\_interspeech.
html.
[4] H. Zhang, R. Sproat, A. H. Ng, et al., “Neural models of text normalization
for speech applications”, Comput. Linguistics, vol. 45, no. 2, pp. 293–337,
2019. DOI: 10.1162/coli\_a\_00349.
[5] E. Pusateri, B. R. Ambati, E. Brooks, O. Pl
´
atek, D. McAllaster, and V.
Nagesha, “A mostly data-driven approach to inverse text normalization”,
in Proceeding of the 18th Annual Conference of the International Speech
Communication Association (Interspeech), ISCA, 2017, pp. 2784–2788.
[6] M. Ihori, A. Takashima, and R. Masumura, “Large-context pointer-generator
networks for spoken-to-written style conversion”, in Proceeding of the 45th
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
IEEE, 2020, pp. 8189–8193. DOI: 10.1109 /ICASSP40776.2020.
9053930.
[7] M. Sunkara, C. Shivade, S. Bodapati, and K. Kirchhoff, “Neural inverse
text normalization”, in Proceeding of the 46th International Conference on
46
Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 7573–
7577. DOI: 10.1109/ICASSP39728.2021.9414912.
[8] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks”, in Advances in Neural Information Processing Sys-
tems 27: Annual Conference on Neural Information Processing Systems
2014, December 8-13 2014, Montreal, Quebec, Canada, Z. Ghahramani,
M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds., 2014,
pp. 3104–3112. [Online]. Available: https://proceedings.neurips.
cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-
Abstract.html.
[9] R. Sproat and N. Jaitly, “An RNN model of text normalization”, in Pro-
ceeding of the 18th Annual Conference of the International Speech Com-
munication Association (Interspeech), ISCA, 2017, pp. 754–758.
[10] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell:
A neural network for large vocabulary conversational speech recognition”,
in Proceeding of the 41st International Conference on Acoustics, Speech
and Signal Processing (ICASSP), IEEE, 2016, pp. 4960–4964. DOI: 10.
1109/ICASSP.2016.7472621.
[11] S. Yolchuyeva, G. N
´
emeth, and B. Gyires-T
´
oth, “Text normalization with
convolutional neural networks”, Int. J. Speech Technol., vol. 21, no. 3,
pp. 589–600, 2018. DOI: 10.1007/s10772-018-9521-x.
[12] C. Mansfield, M. Sun, Y. Liu, A. Gandhe, and B. Hoffmeister, “Neural
text normalization with subword units”, in Proceedings of the 17th Annual
Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies (NAACL-HLT), Associ-
ation for Computational Linguistics, 2019, pp. 190–196. DOI: 10.18653/
v1/N19-2024.
[13] T. Kudo and J. Richardson, “SentencePiece: A simple and language inde-
pendent subword tokenizer and detokenizer for neural text processing”, in
Proceedings of the 2018 Conference on Empirical Methods in Natural Lan-
guage Processing: System Demonstrations, Brussels, Belgium: Association
for Computational Linguistics, Nov. 2018, pp. 66–71. DOI: 10.18653/
v1 / D18 - 2012. [Online]. Available: https : / / aclanthology .
org/D18-2012.
47
[14] M. Lewis, Y. Liu, N. Goyal, et al., “BART: Denoising sequence-to-sequence
pre-training for natural language generation, translation, and comprehen-
sion”, in Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, Online: Association for Computational Linguis-
tics, Jul. 2020, pp. 7871–7880. DOI: 10 . 18653 / v1 / 2020 . acl -
main . 703. [Online]. Available: https : / / aclanthology. org /
2020.acl-main.703.
[15] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of
deep bidirectional transformers for language understanding”.
[16] S. Clinchant, K. W. Jung, and V. Nikoulina, “On the use of BERT for neu-
ral machine translation”, in Proceedings of the 3rd Workshop on Neural
Generation and Translation, Hong Kong: Association for Computational
Linguistics, Nov. 2019, pp. 108–117. DOI: 10.18653/v1/D19-5611.
[Online]. Available: https://aclanthology.org/D19-5611.
[17] A. Graves and A. Graves, “Long short-term memory”, Supervised sequence
labelling with recurrent neural networks, pp. 37–45, 2012.
[18] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-
based neural machine translation”, in Proceeding of the 2015 Conference
on Empirical Methods in Natural Language Processing (EMNLP), The As-
sociation for Computational Linguistics, 2015, pp. 1412–1421. DOI: 10.
18653/v1/d15-1166.
[19] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly
learning to align and translate”, in 3rd International Conference on Learn-
ing Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Con-
ference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Online].
Available: http://arxiv.org/abs/1409.0473.
[20] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need”, in
Proceeding of the Advances in Neural Information Processing Systems 30:
Annual Conference on Neural Information Processing System (NeurIPS),
2017, pp. 5998–6008.
[21] J. Zhu, Y. Xia, L. Wu, et al., “Incorporating BERT into neural machine
translation”, in 8th International Conference on Learning Representations,
ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net,
2020. [Online]. Available: https://openreview.net/forum?id=
Hyl7ygStwB.
48
[22] M. E. Peters, M. Neumann, M. Iyyer, et al., “Deep contextualized word
representations”, in Proceedings of the 2018 Conference of the North Amer-
ican Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana:
Association for Computational Linguistics, Jun. 2018, pp. 2227–2237. DOI:
10.18653/v1/N18-1202. [Online]. Available: https://aclanthology.
org/N18-1202.
[23] Y. Liu, M. Ott, N. Goyal, et al., “Roberta: A robustly optimized BERT
pretraining approach”, CoRR, vol. abs/1907.11692, 2019. arXiv: 1907 .
11692. [Online]. Available: http : / / arxiv . org / abs / 1907 .
11692.
[24] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of
deep bidirectional transformers for language understanding”, in Proceed-
ings of the 2019 Conference of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Language Technologies, Vol-
ume 1 (Long and Short Papers), Minneapolis, Minnesota: Association for
Computational Linguistics, Jun. 2019, pp. 4171–4186. DOI: 10.18653/
v1 / N19 - 1423. [Online]. Available: https : / / aclanthology .
org/N19-1423.
49